DENSITY ESTIMATION WITH QUADRATIC LOSS: A 
CONFIDENCE INTERVALS METHOD 



PIERRE ALQUIER 

Abstract. In |1J, a least square regression estimation procedure was pro- 
posed: first, we condiser a family of functions fk and study the properties 
of an estimator in every unidimensionnal model {afk,a e R}; we then show 
how to aggregate these estimators. The purpose of this paper is to extend this 
method to the case of density estimation. We first give a general overview of 
the method, adapted to the density estimation problem. We then show that 
this leads to adaptative estimators, that means that the estimator reaches the 
best possible rate of convergence (up to a log factor) . Finally we show some 
ways to improve and generalize the method. 



1. Introduction: the density estimation setting 

Let us assume that we are given a measure space {X,B, A) where A is positive 
and CT-finite, and a probabiHty measure P on {X, B) such that P has a density with 
respect to A: 

P{dx) = f{x)\{dx). 

We assume that we observe a reaUsation of the canonical process (Xi, ...,Xjv) on 
{X^ ,B®'^ ,P'^^). Our objective here is to estimate / on the basis of the observa- 
tions Xi, Xn. 

More precisely, let C^{X,\) denote the set of all measurables functions from 
(A", B) to (R, Bb) where Br is the Borel c-algebra on R. We will write L^{X , A) = 
for short. Remark that f E C^. Let us put, for any {g, h) e : 

d\g,h)= I (g{x)~h(x))\{dx), 

and let ||.|| and (., .) denote the corresponding norm and scalar product. We are 
here looking for an estimator / that tries to minimize our objective: 

Let us choose an integer m g IN and a family of functions (/i, /m) G 
There is no particular asumptions about this family: it is not necessarily linearly 
independant for example. 

In a first time, we are going to study estimators of / in every unidimensionnal 
model {a/fe(.),Q; € E.} (as done in [J). Usually these models are too small and 
the obtained estimators do not have good properties. We then propose an iterative 
method that selects and aggregate such estimators in order to build a suitable 
estimator of / (section [IJl . 
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In section 13 we study the rate of convergence of the obtained estimator in a 
particular case. 

In sectional we propose several improvements and generalizations of the method. 
Finally, in sectional we make some simulations in order to compare the practical 
performances of our estimator with other ones. 



2. Estimation method 

2.1. Hypothesis. In this section we will use a particular hypothesis about / 
and/or the basis functions fk,k S {1, ...,m}. 

Definition 2.1. We will say that f and satisfies the conditions Hip) 

for 1 < p < +00 if, for: 

1 1 

- + - = 1, 
P Q 

there exists some (c, ci, ...,c„i) G (E,^)™^ (known to the statistician) such that: 
Vfce{l,...,m}, Q^\fk\^P X{dx)y <ck j^\fk?Kdx) 



and (^J^\f\' X{dx)y <cJ^\f\Xidx) c) , 



For p = 1 the condition 7i(l) is: f is bounded by a (known) constant c and we put 
ci — ... = Cfe = 1. For p = +00 the condition 7i(+oo) is just that every \fk\ is 
bounded by 

\ Ck / fk{x)'^\{dx) 



V Jx 

where Ck is known, and we put c~l. In any case, we put, for any k: 

Ck = CkC. 

Definition 2.2. We put, for any k e {1, ...,m}: 

Dk^ f \fk\^X{dx)^d\fk,0) = \\fk\\^. 
Jx 

2.2. Unidimensionnal models. Let us choose k e {!,... ,m} and consider the 
unidimensionnal model Mk = {afk{ ),a G R}. Remark that the orthogonal pro- 
jection (denoted by ^Mk) of / -^fe is known, it is namely: 

^mJ{-) = akfki-) 

where: 



Uk = argmind^(a/fc, /) 



jx fk{x)f{x)\{dx) _ fk{x)f{x)X{dx) 



'asm. ' fk{xyX{dx) Dk 

A natural estimator of this coefficient is: 



7} fk{Xi) 



j^fk{xYx[dxy 

because we expect to have, by the law of large numbers: 

-Y,fk{X,)^^P[fk{X)]= / fk{x)f{x)X{dx). 
Actually, we can formulate a more precise result. 
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Theorem 2.1. Let us assume that condition Tiip) holds for some p G [l,+oo]. 
Then for any e > we have: 



P^^\yk€{l,...,m},d''iakfk,akfk) 

' 4 [1 + log ^] 



< 



N 



> 1 



The proof is given at the end of the section. 



2.3. The selection algorithm. Until the end of this section we assume that 'H{p) 
is satisfied for some I < p < +oo. 

Let /3(e, fc) denote the upper bound for the model k in theorem l2. II 



Ve > 0,Vfc e {l,...,m} : /3(e, fc) 
Let us put: 



■4[l + log22i] 



TV 



7f J2iLi fkiXiY 

Dk 



Cnk,e - jff e C^,d^iakfk,TlM,g) < /3(£,fc) 
Then theorem 12 . II imnlies the following result. 
Corollary 2.2. For any e > we have: 



yke{l,...,m}JeCTZk,e>>l-e. 



P' 



So for any fc, CTZk,s is a confidence region at level fc for /. Moreover, CTZk,e being 
convex we have the following corollary. 

Corollary 2.3. For any e > we have: 



Vk e {1, m}, V.9 e £\d^UcTz,_^g, /) < d^{g, /)>!-£. 



It just means that for any g, 11^1^5 is a better estimator than g. 
So we propose the following algorithm (generic form): 

• we choose e and start with go = 0; 

• at each step n, we choose a model Mk{n) where k{n) e {1, ...,m} can be 
chosen on each way we want (it can of course depend on the data) and take: 

gn+l = ncKfc(„)_^5n; 

• we choose a stopping time ris on each way we want and take: 

/ = 9n, ■ 

So corollary 1231 implies that: 



d^fJ) = d'ign^J) < - < d^(goJ) = d\OJ) > 1 - e 



P 



Actually, a more accurate version of corollarv 12 . 31 can give an idea of the way to 
choose fc(n) in the algorithm. Let us use corollarv 12. 21 and remember the fact that 
each CTZk,E is convex. 

Corollary 2.4. For any e > we have: 

P^^ f^^k e {1, ...,m},yg e C',d'{Ucn,,^gJ) < d'{g, f) - d'{Ilcn,,^g,g)^ 

> 1 - e. 
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So we propose the following version of our previous algorithm (this is not neces- 
sarily the better choice!): 

• we choose e and < k < 1/N and start with go — 0; 

• at each step n, we take: 

k{n) = arg max d^{T\cnk e9n, 9n) 

and: 

• we take: 

ns = inf{nG]N: d^(g„, < '«} 

and: 

/ = ■ 

So corollary 1231 implies that: 

^ n=0 

2.4. Remarks on the intersection of the confidence regions. Actually, corol- 
la,rv l2.2l could motivate another method. Note that: 

m 

Vfce{l,...,m},/e7^Cfc,e^/e [\nCk,e- 

Let us put, for any / C {1, m}: 

nci.e = fi ncu,e, 

kei 

and: 

Then TZCj^e is still a convex region that contains / and is a subset of every TZCk,e 
for k <E I. So we have the following result. 

Corollary 2.5. For any e > 0; 

P^^'l^I c {1, ...,?7i},Vfc e I,d{f^,_„,},f) < d{fj,f) < d{Unc,,OJ) 

>l-e. 

In the case where we are interested in "model selection type aggregation" of 
estimators, note that, with probability at least 1 — e: 

d{Tlnc,,A I) < d{Unc,,A^kfk) + d{akfk, f) < f3{e, k) + d{fk, /). 

So we have the following result. 

Corollary 2.6. For any e > 0; 

P'^''\d{fii....,rn}J)< inf \d{fkj) + (3{e,k)]]>l-e. 

[ ke{l.,...,m} J 

The estimator /i,...,m can be reached by solving the following optimization prob- 
lem: 
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s.t. Vfce{l,...,m} 



{g - cxkfk, h) - VD k/3{e, k) < 0, 
- (.9 - &kfk, fk) - V^kPie, k) < 0. 
The problem can be solved in dual form: 



max 



E E ^i^k {fi, fk)+2Y^ Ik&kWfkf - 2 5^ |7fc| VDkl3is,k) 

L i=l k=l k=l 

with solution 7* = (7J', ...,7^) and: 



fe=i 



/{i,...,m} - 7fc/fc- 



fe=i 



As: 



i=l fe=l 



and: 



N 



fe=i fc=i "•' " j=i 

we can see this as a penalized maximization of the likelihood. 

We can note that it is easier and more computationnaly efficient to project 
successively on every region TZC{k,e) than to project once on TZC{{1, ...,m},e). 

2.5. An example: the histogram. Let us assume that A is a finite measure and 
let Ai, ...jAm be a partition of X. We put, for any k € {1, ...,m}: 



fk{.)=tAd-)- 



Remark that: 



fk{xfXidx)=XiAk), 



X 



and that condition H(+oo) is satisfied with constants: 

1 



Cfe = 



and (as we have the convention c = 1 in this case) Cfe = CfeC = Cfe. 
In this context we have: 



oik 



Q-k = 



A(Afc) 



P{e,k) 



4 1 



A^A(^fe) 



N 



-j2fk{x,r+i 



Finally, note that all the confidence regions CTZk,e are all orthogonal in this case. 
So the order of projection does not affect the obtained estimator here, and we can 
take: 

/ = ncK„_^...ncKi,jO 
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(and note that / — /{i ,„} here, following the notations of subsection 12.411 . We 
have: 

m 

k=l ^ 

where, for any y € R: 

( y if y>0 

{y)+ - <^ 

[ otherwise. 
In this case corollary 12 .41 becomes : 

r ™ 2 ~i 

d^f, /) < d\0, f)~Y. ("'^- - (Afc) /?(£, fc)) A(A,) >l-£. 

fe=i ^ ^ 

2.6. Proof of the theorem. Before giving the proof, let us state two lemmas that 
we will use in the proof. The first one is a variant of a lemma by Catoni pj, the 
second one is due to Panchenko pA]. 

Lemma 2.7. Let (Ti, ...,T2n) be a random vector taking values in E,^^ distributed 
according to a distribution "P^^JV r/ e M, for any measurable function 

A : — > Rl ^/la^ is exchangeable with respect to its 2 x 2N arguments: 



N .2 2N \ 

i=l 1=1 / 



and the reverse inequality: 



exp A ^ {r. - T,^, } - A_ ^ „ J < exp (-,) , 

where we write: 

V = v{Ti, ■■■,T2n) 
X — X{Ti, ...,T2n) 

for short. 

Proof of lemma WTA In order to prove the first inequality, we write: 

\ 4=1 1=1 / 

= V^^^ exp log cosh I A (T.^, - T,) I - ^ f 

\i=l *• ^ i=l 



We now use the inequality: 

Vx e R, log cosh X < —. 

We obtain: 

log cosh I A (T,^, - T,)} < A!_ (T,^, „ T,)^ < ^ (t;^! + t;^) . 

The proof for the reverse inequality is exactly the same. □ 
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Lemma 2.8 (Panchenko fTi}, corollary 1). Let us assume that we have i.i.d. vari- 
ables ri,...,Tjv (with distribution V and values in 'Si) and an independant copy 
r = {Tn+i,...,T2n) ofT = (Ti,...,Tjv). Let^,iT,r) for j G {1,2,3} be three 
measurables functions taking values in M, and > 0. Let us assume that we know 
two constants A> 1 and a > such that, for any u > 0: 



P' 



6 (r, T') > 6 (T, T') + ^UT,T')u < A exp(~au) 



Then, for any u > 0: 



> P^^^ [UT,T')\T] + ^P®2W [UT,T')\T]uj < Aexp{l - au). 

The proof of this lemma can be found in ^I]. We can now give the proof of 
theorem 12 .11 

Proof of theorem \2.1\ Let [Xj^+i, X2n) be an independant copy of our sample 
{Xi, Xn). Let us choose k £ {l,...,m}. Let us apply lemma ITTI with V = P 
and, for any i £ {1, 2A^}: 

= fu{X,). 

We obtain, for any measurable function ryfe e R, for any measurable function : 
— > E.^ that is exchangeable with respect to its 2 x 2N arguments: 

p«2Ar A, ^1^^^^^^^^ _ ^^^^^^1 -J^H hi^^f -^A< (-^/fc) 

\ 1=1 i=l I 

and the reverse inequality: 



as wall. This impHes that: 

N 



P 



^ N A 

-E{/fe(XO - fu{X,+N)} < ^E/'=(^*) 



and: 



p02N 



4=1 



i=l 



2N 



2 , ^ 

Afc 



1 E{/.(x,+^) - f,{x,)} <^Y. + Y 



< exp(-?7fe) 



< exp (-?7fe) . 



Let us choose: 



A, 



N 



in both inequalities, we obtain for the first one: 



< exp (-?7fe) . 



We now apply lemma ITsl with the same Ti = fk{Xi), rjk ^ u, A = 1, a = 1, ^2 = 0, 



1 " 

^^^N - and 
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N 



We obtain: 



1 ^ 

-J2f,{Xi)-P[MX)]>2^ 



_ p®2JV 



i=l 
1 

TV 



N 



N 



J2fk{Xi)-P[fkiX)]>2\ 

i=l 



vk{jfElifk{Xir+p[fk{xr]} 



N 



< exp(l - rjk) . 



Remark that: 



P [fkiXf] - / Mxff{x)X{dx). 
Jx 

So, using condition and Holder's inequality we have: 
P [fk{Xf] < (^j^ \h{x)\'^ Kdx)^ ' ^j^ f{xy\{dx) 



<[ck fk{xfX{dx) 



X 



(^c jj{x)\{dx) 

= (ckc) / fkixfX{dx) = CkDk- 

JX 



Now, let us combine this inequality with the reverse one by a union bound argument, 
we have: 



1 ^ 

-Y^f,{x,)-P[ux)] 



> 2 



r)k{jfT.tjk{XiY + CuDu} 



N 



< 2exp(l - rtk) ■ 



We now make a union bound on k £ {1, to} and put: 



We obtain: 



\/k e {l,...,m}, 



rjk = 1 + log— . 



1 ^ 

-J2f,{x,)-p[ux)] 



< 2 



(1 + log 2f ) Eti ^(^i)' + CkDk} 



N 



>l-s. 



We end the proof by noting that: 



d^{akfk-,OLkfk) = 



N 



j:tih{Xi)-P[fk{x)] 



1 2 



j^fk{xyx{dx) 



□ 
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3. Some examples with rates of convergence 

3.1. General remarks when {fk)k is an orthonormal family and condition 
H{1) is satisfied. In siihsections l3. 11 IT2l a,nd IT3l we study the rate of convergence 
of our estimator in the special case where {fk)keTSi* is an orthonormal basis of 
so we have: 

Dk^ f fkixfKdx) = 1 

and: 

/ h{x)fk-{x)\(dx)^Q 
Jx 

\{k^ k'. 

We also assume that condition is satisfied: Vx e A", f{x) < c, remember 

that in this case we have taken = 1 and so Ck — c, so: 

■4[i + iog^] ^ V 



N 



N 

i=l 



Note that in this case all the order of application of the projections U-jic^^ does 
not matter because these projections works on orthogonal directions. So we can 
define, once m is chosen: 

/ = n-Rc,,, ^...nKCi^o = nKC{i „,,^o = 

(following the notations of subsection 12 .411 . Note that: 

m 

/(^) = ^s'i-9'n{ak) (l&kl - \/(i{e,k)^ fk{x) 

k=l ^ 

where sign{x) is the sign of x (namely +1 if x > and —1 otherwise), and so / is 
a soft-thresholded estimator. Let us also make the following remark. As for any x, 
f{x) < c, we have: 

d'(/,0) <c. 

So the region: 

S = |.9 e £2 : Vfc e M*, y g{x)fk{x)X{dx) < ^^| 

is convex, and contains /. So the projection on B, Hb can only improve /. We put: 

/ = He/. 

Note that this transormation is needed to obtain the following theorem, but does 
not have practical incidence in general. Actually: 

fix) = ^sign{ak) I (\ak\ - \//3(e, fc)) /\Vc> fk{x). 
k=l ^ ^ ^ 

3.2. Rate of convergence in Sobolev spaces. It is well known that if / has 
regularity /S (known by the statistician) then we have the choice 

and a standard estimation of coefficients leads to the optimal rate of convergence: 

-2/3 
_/V 2/3 + 1 _ 

Here, we assume that we don't know (3, and we show that taking m = N leads 
to the rate of convergence: 

ATW log AT 

namely the optimal rate of convergence up to a log factor. 
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Theorem 3.1. Let us assume that {fk)kew' is an orthonormal basis of . Let us 
put: 

7m = arg min d'^{g,f), 

geSpan(fi,...,f,„) 

and let us assume that f ^ satisfies condition and is such that there are 

unknown constants D > and f3 > 1 such that: 

d\l„,J)<Dm-^P. 

Let us choose m — N and e = N^"^ in the definition of f. Then we have, for any 
N > 2: 

2/3 

P^''d'{fJ)<D'{c,D) 

Here again, the proof of the theorems are given at the end of the section. Let 
us just remark that, in the case where X — [0, 1], A is the Lebesgue measure, and 
{fk)kem* is the trigonometric basis, the condition: 

d^J^J) < Dm-'" 

is satisfied for D = D{(3, L) = ^27^-2/? as soon as / € W{I3, L) where W{(3, L) is 
the Sobolev class: 

|/ e Z:^ : f^^-^'^ is absolutely continuous and f^'^\xfX{dx) < 
see Tsybakov 53| for example. The minimax rate of convergence in W{(3, L) is 

2/3 

N 2/3+1 ^ so we can see that our estimator reaches the best rate of convergence up 
to a log N factor with an unknown (3. 



3.3. Rate of convergence in Besov spaces. We here extend the previous result 
to the case of a Besov space Bs^p^q. Note that we have, for any L > and /3 > 0: 

W{f3,L)cBi3a.2 

so this result is really an extension of the previous one (see Hardle, Kerkyacharian, 
Picard and Tsybakov |I,0 , or Donoho, Johnstone, Kerkyacharian and Picard PI). 
We define the Besov space: 



B, 



g : [0,1] ^ R, g{.) = a0(.) + ^ ^ 

^ j=0 fc=l 



2^ 



k=l 



^\\9\\lp,q<+00\, 
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with obvious changes for p = +00 or q = +00. We also define the weak Besov 
space: 

{00 2^ 
g : [0, 1] ^ M, g{.) = a(/.(.) + ^ Pj,k^jM-), 
j=0 fc=l 

supA''^2^(t-i)^l{l^^,l>,}<+(^ 
3=0 k=i ) 

{00 2^ 
j=0 k=l 



j=0 fc=l J 

see Cohen [2] for the equivalence of both definitions. Let us remark that i?s,p,g is a 
set of functions with regularity s while Wp,7r is a set of functions with regularity: 

, 1 /vr 
2 \p 

Theorem 3.2. Let us assume that X — [0, 1], and that (V'i,fe)j"=o,...,+oo,/c6{i,...,23} 
a wavelet basis, together with a function 4>, satisfying the conditions given in P] and 
having regularity R (for example Daubechies' families), with (p and Tpo.i supported 
by [—A, A]. Let us assume that f € Bg.p.q with R + 1 > s > i, I < q < 00, 
2 < p < +00, or that f £ i?s,p,g n W_2__2 with R+l> s > ^, I <p < +00, with 
unknown constants s, p and q and that f satisfies condition 7i(l) with a known 
constant c. Let us choose: 

{/i, fm} = m U {iPj^kJ = 1, 2L^J ,k^l, 2^"} 
(so ^- <m < N) and e = N ^ in the definition of f . Then we have: 

' 'logiV 



P®^d^{f,f)^0 



N 



Let us remark that we obtain nearly the same rate of convergence than in P], 
namely the minimax rate of convergence up to a log factor. 

3.4. Kernel estimators. Here, we assume that X = M. and that / is compactly 
supported, say by [0, 1]. We put, for any to S IN and k € {1, m}: 

fk{x)=K(^^,x 

where K is some function E. x E. ^ E, and we obtain some estimator that has the 
form of a kernel estimator: 



f{i,...,m}ix) ^^akK i — ,x j 
fc=i ^ 



Moreover, is is possible to use a multiple kernel estimator. Let us choose n e M, 
/i S M, ft, kernels Ki, Kh and put, for any k — i + n*jG {1, m — hn}: 

fk{x) = Kj ( -,x 
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We obtain a multiple kernel estimator: 



/{i,...,m}(a;) = X! X! 

1=1 j=i 

3.5. Proof of the theorems. 

Proof of the.orem. \Sl[ Let us begin the proof with a general m and e, the reason of 
the choice m = N and e = N~'^ will become clear. Let us also write £{e) the event 
satisfied with probability at least 1 — £ in theorem 12. II We have: 



P®^S{fJ) = P 



For the first term we have: 



P 



{1-M,))d\fj) 



X 



d^ifj)<'2 f{xfX{dx) + 2 f{xfX{dx)<2c + 2mc = 2{m + l)c 



X 



< 2e{m+l)c. 



and so: 

For the other term, just remark that under £{e): 

d\Lf) = d2(n6nc7j„,....ncK,,o,/) 

< d2(ncK^ ^...ncKi,,o,/) < d\ncn^, ^...ncm,^G,f) 

for any m! < m, because of theorem 12.11 more precisely of corollarv 12.31 And we 
have: 

d^{nM^,-TlM^OJ) 



< 



fc=i 



■4[l + log^] 



So we have: 



t£(s)d'ifj) 
m' 
k=l 



< P' 



N 



d\fj) 



J L 1=1 



■ 4 [1 + log ^] 



TV 



+ {m'r^PD 



So finally, we obtain, for any m! < m: 

^„ , ~ Sm'ch+log^l 
P®^d^{fJ) < ^ ' ^ 

The choice of: 



N 



+ (m')"^'^!) + 2e(m + l)c. 



logA^ 



-23 



leads to a first term of order N^f+^ log — (log A^) 2/3+1 and a second term of order 

TV 23+1 (log iV) 2/3+1 . The choice of to = iV and s = A^^ gives a first and second 
term at order: 

. 2<3 

logAr\ 5^ 
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while keeping the third term at order N ^. This proves the theorem. □ 

Proof of the.orem. W^ Here again let us write £{e) the event satisfied with proba- 
bility at least 1 — e in theorem 12 .11 We have: 



P^''d'if,f) = P' 



P 



For the first term we still have: 



cfifj) < 2(m+l)c. 
For the second term, let us write the development of / into our wavelet basis: 

oo 2^ 



and: 



j=0 fc=l 
J 2J 



f{x) = 50 + ^ ^ Pj,k'ipJ,k 
j=0 k=l 



the estimator /. Let us put: 



J 



log TV 
log2. 



For any J' < J we have: 

dHfJ) = d\nBllcn„. ^...Ilcn,,Af) < d^(ncn„...■■.Tlcn^,A f) 

J 2' oo 2^ 

= (a-a)2+^^(/3,,,-/3,,.)2+ ^ 

j=0 k=l j = .J+l k=l 

J' 2' J' 2^ 

<{a- ar+Y,T.(hk - f3j,krtm,k\ > n) + EE^I.'cid^j-^fci < 

j=0 k=l j=0 k=l 

oo 2' 

+ E E^h 

j=J' + l k=l 

for any k > 0, as soon as £{e) is satisfied (here again we applied theorem I2.1|l . In 
the case where p> 2 we can take: 



J' 



log A^i+2» 
log 2 



and K = to obtain (let C be a generic constant in the whole proof): 

2J 



oo 2^ oo 

E T.^h^ E IE/51.1 2' 

j=J' + lfe=l j = J' + l \k=l 

As / G Ss.p,? C i?s,p, oo we have: 



^■(1-1) 



E/3^J < C2-2^(^+3-i) 



,k=l 



and so: 



j=J+l k=l 
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and: 

J 2^ o Fi I 1 2m] J 2J 

j=0 fc=l j"=0 fe=l 

- TV - TV 

So we obtain the desired rate of convergence. In the case where p < 2 we let J' = J 
and proceed as follows. 

j=o fc=i i=o fe=i 

8c[l + log2f] 

< —Ck 2s + 1 

TV 

because / is also assumed to be in the weak Besov space. We also have: 

EE'3'fei(i/^^-.'^i<«^)^^'^'"'^- 

j=0 k=l 

For the remainder term we use (see jlfll l9]l: 

Bs,p,q C 2,q 

to obtain: 

oo 2^' 

j=j+i 

as s > i. Let us remember that: 

p 

TV , 
— < m = 2'^ < TV 
2 - 



and that e — N , and take: 



/log TV 

to obtain the desired rate of convergence. □ 



4. Better bounds and generalizations 

Actuahy, as pointed out by Catoni p], the symmetrization technique used in the 
proof of theorem 12.11 causes the loss of a factor 2 in the bound because we upper 
bound the variance of two samples instead of 1. In this section, we try to use this 
remark to improve our bound, using techniques already used by Catoni We also 
give a generalization of the obtained result that allows us to use a family (/i, fm) 
of functions that is data-dependant. The technique used is due to Seeger ll2i, and 
it will allows us to use kernel estimators as Support Vector Machines. 

Remark that the estimation technique described in section|21does not necessarily 
require a bound on cP{akfk,akfk)- Actually, a simple confidence interval on ak is 
sufficient. 



DENSITY ESTIMATION WITH QUADRATIC LOSS: A CONFIDENCE INTERVALS METHOQB 



4.1. An improvement of theorem 12.11 under condition H{+oo). Let us re- 
member that ^(+00) just means that every fk is bounded by ^/CkDk- 

Theorem 4.1. Under condition H{+oo), for any e > 0, for any Pk,i,(3k.2 such 
that: 

N 

^<(^kj<^==, je{l,2}, 

V C'fc^/c 

with P'^^ -probability at least 1 — £, for any k E {1, m} we have: 

aL"'(£,/3fc,i)<afc<a^r(£>A,2) 

with: 



and: 



N - Nejip 



N 



iVexp 



N 



N 



Before we give the proof, let us see why this theorem really improves theorem 
12.11 Let us choose put: 



Vk=p{[fkiX)-PifkiX))f} 



and: 



Then we obtain: 



Pk,l — /3fe,2 — 



'TV log ^ 



Vk 



and: 



'2V^felog^ 



N 



' 2Ffclog ^ 
N 



Of 



Of 



log^ 



N 



log^ 
N 



So, the first order term for d'^{akfk,otkfk) is: 

2Ffclog^ 
N ' 

there is an improvement by a factor 4 when we compare this bound to theorem 12. II 
Remark that this particular choice for I3k,i and I3k,2 is valid as soon as: 



' AT log 2m 



< 



N 



V Vk VCMc 
or equivalently as soon as N is greater than 

Cfci^fclog^ 
Vk 

In practice, however, this particular Pk,i and Pk,2 are unknown. We can use the 
following procedure (see Catoni [^). We choose a value a > 1 and: 



B = i a',0 < / < 



log 



log a 
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By taking a union bound over all possibles values of B, with: 

log- ^ 



\B\ < 



log a 



we obtain the following corollary. 



Corollary 4.2. Under condition 7i(+oo), for any a > 1, for any e > 0, with 
-probability at least 1 — e we have: 



sup ttfc 



e log a 



1 , ^ ^ „ , /3 ) < ak < inf ck^^ 



P(zB \\ogN - ^ log Cfc Dk 
with: 

B = I a',0 < I < 



eloga 



pes \\ogN - \\ogCkDk 



,/3 , 



log 



N 



log a 



Note that the price to pay for the optimization with respect to Pk,i and Pk,2 was 
just a log log factor. 

Proof of the theorem. The technique used in the proof is due to Catoni [^j- Let us 
choose fc G {1, m}, and: 

N 



/?€ 0, 



We have, for any 77 G R: 



P^^exp l^log (^1 - - r?| < exp|iVlog(l - -^P[/,(X)]) - ryj. 



Let us choose: 



We obtain: 



9m / R 

77 = log— + 7Vlog(l-^P[/,(X)] 



P«^exp jglog (l - ^fuiX.)) log^ - 7Vlog(l - |p [fk{X)]) | < 



e 

2^' 



and so: 



p«^ jf: log (1 - |/.(xo) > log ^ + iv iog(i - ^p [fk{x)]) I < 



2^' 



that becomes: 



P«^<^P[/fe(X)] > 



N 
J 



N 



log^ 



TV 



< 



2m 



We apply the same technique to: 



P^^exp l^log (^1 + - ?/| < expj 



iVlog(l + ^P[/,(X)]) -7? 



to obtain the upper bound. We combine both result by a union bound argument. 

□ 
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4.2. A generalization to data-dependent basis functions. We now extend 
the previous method to the case where the family (/i, /„) is allowed to be data- 
dependant, in a particular sense. This subsection requires some modifications of 
the notations of section [21 

Definition 4.1. For any ml e M* we define a function Qm' '■ {^^) ■ -'^"'^ 

any i G {1, N} we put: 

Qrn'{Xi) — ifi,l, fi,m') ■ 

Finally, consider the family of functions: 

ifl, fm) = ijl.li fl.m', In,!, fN,m') ■ 

So we have m = m! N (of course, m' is allowed to depend on N). Let us take, for 
any i G {1, N}: 

We put, for any (i, k) G {1, A^} x {1, m'}; 



and we still assume that condition 7Y(oo) is satisfied, that means here that we have 
known constants Cik = Ci k such that: 



Finally, we put: 

ai^k = argmind^(a/i,fc, /). 

Let us choose {i,k) G {1, •••,iV} x {1, ...,m'}. Using Seeger's idea, we follow the 
preceding proof, replacing P'^^ by Pi, and using the iV — 1 random variables: 



(^fi,kiXj 



j e {1, ...,N} 

j 7^ i 



with 

V = log ^ + {N-l) log {l j^P (X)] 

and we obtain: 

P. expj^ log (l - - log ^ 



Note that for any random variable H that is a function of the Xf. 

P®^P,H = P®^H. 

So we conclude exactly in the same way than for the previous theorem and we 
obtain the following result. 

Theorem 4.3. For any e > 0, for any l3i,k,i, Pi.k,2 such that: 

N -1 

< p,,k,j < ^=^=, je{l,2}, 

with P'^^ -probability at least 1 — e, for any i G {1,...,N} and k G {l,...,m} we 
have: 
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with: 

TV - 1 - (AT - 1) exp 



log 



N-1 



{N - l)exp 



log ^ 



JV-1 



-iV + 1 



Example 4.1 (Support Vector Machines). Actually, SVM were firstly introduced 
by Guyon, Baser and Vapnik [HI in the context of classification, but the method was 
extended by Vapnik [15] to the context of least square regression estimation and of 
density estimation. The idea is to generalize the kernel estimator to the case where 
X is of large dimension, and so we cannot use a grid like we did in the [0, 1] case. 
Let us choose a function: 

K -.X"^ ^Wi 

{x, x') I— > K{x, x'). 

We take m! — 1 and: 

ei{x)^{K{x,.)) 

then the obtained estimator has the form of a SVM: 

N 



where the set of i such that on ^ is expected to be small. Note that we do not need 
to assume that K{.,.) is a Mercer's kernel as usual with SVM. Moreover, we can 
extend the method to the case where we have several kernels Ki, ...,Km' by taking: 



Qra'{x) = {Ki{x,. ),..., Kjn'{x,. )) . 



The estimator becomes: 



N 



fix) = ^^<5ijXj(Xi,a;). 

j=i i=\ 

Note that a widely used kernel is the gaussian kernel; let 5{., .) be a distance on X 
and 7i, ...,7m' > then we put: 

Kk{x,x') = exp (-7A;(5^(x,a;')) . 

For example, if X — 'K and A is the Lebesgue measure then hypothesis Tl{oo) is 
obviously satisfied with the gaussian kernel with 
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4.3. Back to the histogram. In the case of the histogram, fk{-) = ^A^i-) can 
take only two values: and 1. Remember that Dk = A(Afc). So: 



N 



A(^fc)/3fe,i 
Remember that, for any a; > 0: 



1 + 



N 



\{i:X,(iA^}\ 



e 

2m 



- 1 



and so: 



1 



Now, we take the grid: 

B = |2',0 < I < 
Remark that, for any /3 in: 

1, 



2N 



log ^ 



N 



DkPk, 



\2mJ 



log 2 



- 1 



N 



there is some b e B such that (i <b < 2(3, and so: 



-ris,b)>ak{^y 



2N 



N 



Dk2Pk- 



This allows us to choose whatever value for pk,i in 

N 



1, 



2vd:. 



Let us choose: 



I3k, 



7V2 



(2m) 



\ dkDk{l - otkDk) 



that is allowed for A'' large enough. So we have: 



af{£,pk,i) > ak (^) " - )^dkDk{l - dkDk) 
With the union bound term (over the grid B) we obtain: 

,inf ( '^logZ 



\2m) 



> dk 



e\og2 



2rnlog 



N 



\ 



dkDk{l - dkDk) 



e\og2 



2m log 



N 



= oik- 



i 



2m log 

dkDkil - dkDk) log ,j,gf^ ^ ^ /log iTiMiv 



TV 



TV 



remark that we have this time the "real" variance term of l^^(X): 
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4.4. Another simple example: the Haar basis. Let us assume that X = [0, 1]. 
Let {ip,tp) be a father wavelet and the associated mother wavelet, and: 

for k e {0, ...,2-' — 1} = Sj (note that the wavelet basis is non-normalized here). 
Here, we use the Haar wavelets, with: 

ip{x) = ll[o,i](a;) 
For the sake of simplicity, let us write: 

for k € {0} = S-i. By an obvious adaptation of our notations, let us put aj^k the 
coefficient associated to V'j,/s- 



remark that condition H{oo) is satisfied with Dj^k = S^-' and Cj,k = 1- In this 
particular setting, note that a_i,o = 1 is known, so the associated confidence 
interval is just {!}. Moreover, here V'j,fe(^) can take only three values: —1, and 
1. Let us put: 

1 ^ 



N 



i=l 



Remark that in this case we have: 

N 



1 

N 



P(^,-,(X) = l)log (^1 

= ip[^,,,(X)2]log(l-^) +ip[V',,.(X)]logfi^j . 



So we have: 
N - iVexp 



ip log (i - ^) - [i^j,km log (^4) - 



N 



k,2 



and: 



inf 



Nexp 



\p log (i - ^) + \p Vl^jAX)] log (i^) - 



log ^ 



N 



-N 



5. Simulations 

5.1. Description of the example. We assume that we observe X,; for ?' G {l....,A^} 
with iV = 2^" = 1024, where the variables Xi G [0, 1] C E. are i.i.d. from a distribu- 
tion with an unknown density / with respect to the Lebesgue measure. The goal 
is to estimate /. 

Here, we will use three methods. The first estimation method will be a multiple 
kernel estimator obtained by the algorithm described previously, the second one a 
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thresholded wavelets estimate also obtained by this algorithm, and we will compare 
both estimators to a thresholded wavelet estimate as given by Donoho, Johnstone, 
Kerkyacharian and Picard jH]- 

5.2. The estimators. 

5.2.1. Hard-thresholded wavelet estimator. We first use a classical hard-thresholded 
wavelet estimator. 

In the case of the Haar basis (see subsection I4.4|l . we take: 

. 1 ^ 

For a given k > and J € M, we take: 

J 



-1 fceSj 



where: 



tj,N — \l — 



Actually, we must choose J in such a way that: 



2-^ - f-^ 



Here, we choose k = 0.7 and J — 7. 



5.2.2. Wavelet estimators with our algorithm. We also use the same family of func- 
tions, and we apply our thresholding method, with bounds given in subsection 14.41 
So we take: 

m = 2-' = 128. 

We use an asymptotic version of our confidence intervals inspired by our theo- 
retical confidence intervals: 



aj,k ± 



TV 



where Vj^k is the estimated variance of ipj,kiX): 



1 ^ r 1 ^ 

1=1 L h=l 



Let us remark that the union bound are always "pessimistic", and that we use 
a union bound argument over all the m models despite only a few of them are 
effectively used in the estimator. So, we propose to actually use the individual 
confidence interval for each model, replacing: the log ^ by log |. 

5.2.3. Multliple estimator. Finally, we use the kernel estimator described in section 
with function K: 

Kj{u,v) = exp [-~2^'{u - v)"^] 

with n = N and j € {1, h — 6}. We add the constant function 1 to the family. 

Here again we use the individuals confidence intervals, and the asymptotic ver- 
sion of this intervals. 
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Figure 1. Values of ti and q in the fonction Blocks{.). 
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Figure 2. Results of the experiments. For each experiment, we 
give the mean distance of the estimator the the density {cP{., /)). 



Function 
/(•) 


standard thresh- 
olded wavelets 


thresh, wav. with 
our method 


multiple kernel 


Doppler 


0.104 


0.127 


0.083 


HeaviSine 


0.071 


0.066 


0.040 


Blocks 


0.110 


0.142 


0.121 



5.3. Experiments and results. The simulations were realized with the R soft- 
ware |12l| . 

For the experiments, we use the following functions / that are some variations 
of the functions used by Donoho and Johnstone for experiments on wavelets, for 
example in jS] (actually, these functions were used as regression functions, so the 
modification was to add them a constant in order to ensure they take nonnegative 
values) : 

DopplerU) = 1 + 2Jt{l - t) sin '^'^^^ ^ where d = 0.05 

t + V 

HeaviSine{t) = 1.5 + ^ |^4sin47ri - sgn{t - 0.3) - sgn{0.72 - t) 
1 " 

Blocks{t) = 1-05 + - ^ c,l(t.^+oo) it) 
1=1 

where sgn{t) is the sign of t (say —1 if i < and +1 otherwise). The values of the 
Ci and ti are given in figure 1. 

We consider 3 experiments (for the three density functions), we choose e=10%, 
repeat each experiment 20 times; the results are reported in figure 2. We also give 
some illustrations (figure 3, 4 and 5). 
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0.0 0,2 0.4 0,6 0.8 1,0 0.0 0,2 0.4 0,6 0.8 1,0 




0.0 0,2 0.4 0,6 0.0 1,0 0.0 0,2 0.4 0,6 0.0 1,0 



Figure 3. Experiment 1, / = Doppler. Up-lcft: true regression 
function (true). Down-left: SVM (/). Up-right: wavelet estimate with 
our algorithm (ondelrel). Down-right: "classical" wavelet estimate 
{ondelseu). 




0.0 2 0.4 6 0.8 1 0.0 2 0.4 6 0.8 1 




0.0 0,2 0.4 0,6 0.8 1,0 0.0 0,2 0.4 0,6 0.8 1,0 



Figure 4. Experiment 2, / = HeaviSine and a = 0.3. 
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0.0 0,2 0.4 0,6 0.8 1,0 0.0 0,2 0.4 0,6 0.8 1,0 




0.0 0,2 0.4 0,6 0.0 1,0 0.0 0,2 0.4 0,6 0.0 1,0 



Figure 5. Experiment 3, / = Blocks and a = 0.3. 
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